NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

https://doi.org/10.1109/ISCA59077.2024.00019

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Shah, Aashaka; Goiri, Íñigo; Maleki, Saeed; Bianchini, Ricardo (June 2024, IEEE)

Full Text Available
Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

https://doi.org/10.1145/3669940.3707226

Reidys, Benjamin; Zardoshti, Pantea; Goiri, Íñigo; Irvene, Celine; Berger, Daniel S; Ma, Haoran; Arya, Kapil; Cortez, Eli; Stark, Taylor; Bak, Eugene; et al (February 2025, ACM)

Free, publicly-accessible full text available February 3, 2026
Characterizing Power Management Opportunities for LLMs in the Cloud

https://doi.org/10.1145/3620666.3651329

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Goiri, Íñigo; Warrier, Brijesh; Mahalingam, Nithish; Bianchini, Ricardo (April 2024, Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3)

Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss.
more » « less
Full Text Available
Making Kernel Bypass Practical for the Cloud with Junction

Fried, Joshua; Chaudhry, Gohar Irfan; Saurez, Enrique; Choukse, Esha; Goiri, Íñigo; Elnikety, Sameh; Fonseca, Rodrigo; Belay, Adam (April 2024, 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI'24))

Kernel bypass systems have demonstrated order of magnitude improvements in throughput and tail latency for network-intensive applications relative to traditional operating systems (OSes). To achieve such excellent performance, however, they rely on dedicated resources (e.g., spinning cores, pinned memory) and require application rewriting. This is unattractive to cloud operators because they aim to densely pack applications, and rewriting cloud software requires a massive investment of valuable developer time. For both reasons, kernel bypass, as it exists, is impractical for the cloud. In this paper, we show these compromises are not necessary to unlock the full benefits of kernel bypass. We present Junction, the first kernel bypass system that can pack thousands of instances on a machine while providing compatibility with unmodified Linux applications. Junction achieves high density through several advanced NIC features that reduce pinned memory and the overhead of monitoring large numbers of queues. It maintains compatibility with minimal overhead through optimizations that exploit a shared address space with the application. Junction scales to 19–62× more instances than existing kernel bypass systems and can achieve similar or better performance without code changes. Furthermore, Junction delivers significant performance benefits to applications previously unsupported by kernel bypass, including those that depend on runtime systems like Go, Java, Node, and Python. In a comparison to native Linux, Junction increases throughput by 1.6–7.0× while using 1.2–3.8× less cores across seven applications.
more » « less
Full Text Available
Faster and Cheaper Serverless Computing on Harvested Resources

https://doi.org/10.1145/3477132.3483580

Zhang, Yanqi; Goiri, Íñigo; Chaudhry, Gohar Irfan; Fonseca, Rodrigo; Elnikety, Sameh; Delimitrou, Christina; Bianchini, Ricardo (October 2021, SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles)

Full Text Available

Search for: All records